Skip to content

HDDS-14041. Add metrics to track Snapshot RocksDB space and SST File stats.#9406

Merged
jojochuang merged 9 commits intoapache:masterfrom
sadanand48:HDDS-14041
Apr 6, 2026
Merged

HDDS-14041. Add metrics to track Snapshot RocksDB space and SST File stats.#9406
jojochuang merged 9 commits intoapache:masterfrom
sadanand48:HDDS-14041

Conversation

@sadanand48
Copy link
Copy Markdown
Contributor

@sadanand48 sadanand48 commented Dec 1, 2025

What changes were proposed in this pull request?

Add metrics to track Snapshot RocksDB space and SST File stats

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14041

How was this patch tested?

Tried out on docker

  "beans" : [ {
    "name" : "Hadoop:service=OzoneManager,name=OMSnapshotDirectoryMetrics",
    "modelerType" : "OMSnapshotDirectoryMetrics",
    "tag.Context" : "Snapshot Directory Metrics",
    "tag.Hostname" : "b8f9c29c90ba",
    "DbSnapshotsDirSize" : 1162516,
    "TotalSstFilesCount" : 36,
    "NumSnapshots" : 3
  } ]
}

@sadanand48 sadanand48 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Dec 1, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive metrics tracking for Ozone Manager's snapshot RocksDB directories, monitoring disk space usage and SST file counts. The implementation introduces a new metrics class that periodically collects statistics both at the aggregate level (total snapshots directory size and SST file count) and per-checkpoint directory level, with configurable update intervals.

Key Changes

  • Introduces OMSnapshotDirectoryMetrics class with asynchronous metrics collection using a Timer-based scheduler
  • Adds configuration property ozone.om.snapshot.directory.metrics.update.interval (default: 5 minutes) to control update frequency
  • Integrates metrics lifecycle into OzoneManager's start/restart/stop methods

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
OMSnapshotDirectoryMetrics.java New metrics class implementing periodic collection of snapshot directory size and SST file statistics with both aggregate and per-checkpoint metrics
OzoneManager.java Integrates snapshot directory metrics into OM lifecycle by starting metrics collection on start/restart and stopping on shutdown
OMMetrics.java Adds snapshot directory metrics management methods and field to hold the metrics instance
OMConfigKeys.java Defines configuration key and default value for metrics update interval
ozone-default.xml Adds configuration property documentation for the metrics update interval setting

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jojochuang
Copy link
Copy Markdown
Contributor

@sadanand48 also I updated the jira release note. PTAL

@sadanand48 sadanand48 marked this pull request as ready for review December 10, 2025 08:33
@jojochuang jojochuang requested a review from Copilot December 12, 2025 20:39
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Just want to see what copilot says but it seems to get stuck.

Copy link
Copy Markdown
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch @sadanand48 have some nitpicky review comments on the code modularity and design of this. Let us make this change to beautify the code a bit so that it would be easier to add more metrics later on.

*/
@InterfaceAudience.Private
@Metrics(about = "OM Snapshot Directory Metrics", context = OzoneConsts.OZONE)
public final class OMSnapshotDirectoryMetrics implements MetricsSource {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having inner static classes. I would suggest to create snapshot.metrics package have these package private classes. It would make the code more readable.

Copy link
Copy Markdown
Contributor

@swamirishi swamirishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a few doubts left comments inline

*/
private long calculateDirSizeAccountingForHardlinks(File directory)
throws IOException {
Set<Object> visitedInodes = new HashSet<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a globalSet ? We can make this class variable and clear at the end of each run to reuse

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the directory here is the db.snapshots/checkpointState dir that has all checkpoint dirs so we should be good


// Add per-checkpoint-directory metrics from cached map
Map<String, CheckpointMetrics> currentMetrics = checkpointMetricsMap;
for (Map.Entry<String, CheckpointMetrics> entry : currentMetrics.entrySet()) {
Copy link
Copy Markdown
Contributor

@swamirishi swamirishi Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currentMetrics is a HashMap order would be indeterministic. We should use a treeMap to ensure this

Comment thread hadoop-hdds/common/src/main/resources/ozone-default.xml
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hadoop-hdds/common/src/main/resources/ozone-default.xml Outdated
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Jan 10, 2026
@github-actions
Copy link
Copy Markdown

Thank you for your contribution. This PR is being closed due to inactivity. If needed, feel free to reopen it.

@github-actions github-actions Bot closed this Jan 17, 2026
@jojochuang jojochuang reopened this Apr 2, 2026
@github-actions github-actions Bot removed the stale label Apr 3, 2026
Copy link
Copy Markdown
Contributor

@jojochuang jojochuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. merging it.

@jojochuang jojochuang merged commit 5c53d2e into apache:master Apr 6, 2026
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

snapshot https://issues.apache.org/jira/browse/HDDS-6517

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants